A New Dynamic and Adaptive Scheme for Indexing in Metric Spaces
Umut Tosun
MSc.Student
Computer Engineering Department
Bilkent University
Computer Science applications are often concerned with efficient
storage and retrieval of data. Well defined structure of traditional
databases help to access required query objects effectively using the
Relational Database paradigm. However, in recent times, we are faced
with the challenges of dealing with unstructured and complex data such
as images, video, sound clips and text documents.
Multimedia Information Retrieval, Data Mining, Pattern Recognition,
Machine Learning, Computer Vision, Biomedical Databases are examples
of some of the fields that require efficient management of complex
data. Complex, unstructured type of data often cannot be broken down
into well-defined components, and exact matching cannot be applied for
defining queries. Instead, the notion of similarity search is used
where a query or prototype object is provided by the user and the
database retrieves objects that are similar.
One popular approach for similarity searching is to approximate the
relationship between database objects by mapping them into a vector
space.
There are well-known indexing methods in literature that supports
similarity queries in vector spaces, however, it has been shown that
these methods are ineffective for high dimensional data. Another
approach is to use Metric Spaces model for indexing. Metric spaces are
defined by a distance function that has the triangular inequality
property. Since there are no assumptions about the structure of the
data itself, they constitute a higher level abstraction and thus have
more applicability. They have also been shown to perform better in
higher dimensions.
A lot of the previous work in metric spaces have concentrated on
static methods that do not allow new insertions once the index
structure has been initialized. M-Tree, Slim-Tree, DF-Tree, Omni are
some of the popular dynamic structures. These methods can grow
incrementally by splitting overflowed nodes and adding new levels to
the tree very much like the B-tree variants. Unfortunately, they have
been shown to perform very poorly compared to flat structures such as
AESA, LAESA, Spaghettis and Kvp that use a fixed set of global
pivots. The distances between the query object and the pivots are
computed to eliminate some portion of the database from
consideration. The number of pivots can be easily increased to provide
more selectivity, thus better query performance.
However, there is an optimum number of pivots for a given query
radius, and using too many pivots increases the costs of queries and
the initialization of the index.
Recently, SSS was introduced as a LAESA variant that allows insertions
of new database objects and dynamically promotes some of the new
objects as pivots.
In this thesis, we argue that SSS has fundamental problems that
results in poor query performance for clustered or otherwise skewed
distributions. Real datasets have often been observed to show such
characteristics.
We show that SSS has been optimized to work for a symmetrical,
balanced distribution and for a specific radius value. Our first main
contribution is offering a new pivot promotion scheme that can perform
robustly for clustered or skewed distributions. Our second
contribution is proposing new methods that solves the problem of
determining the right number of pivots for different query radius
values. We show that our new indexing scheme performs significantly
better than tree-based dynamic structures while having lower insertion
costs. We also show that our structure adapts to changes in the
database population in a superior way.
DATE:
17 August, 2007, Friday@ 14:00
PLACE:
EA 409